{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# [spaCy](http://spacy.io/docs/#examples) introduction"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Load spaCy resources"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "# Import spacy and English models\n",
    "import spacy\n",
    "\n",
    "nlp = spacy.load('en')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Loading spaCy can take a while, in the meantime here are a few definitions to help you on your NLP journey.\n",
    "\n",
    "#### What are Stop Words?\n",
    "\n",
    "Stop words are the common words in a vocabulary which are of little value when considering word frequencies in text. This is because they don't provide much useful information about what the sentence is telling the reader.\n",
    "\n",
    "Example: _\"the\",\"and\",\"a\",\"are\",\"is\"_\n",
    "\n",
    "#### What is a Corpus?\n",
    "\n",
    "A corpus (plural: corpora) is a large collection of text or documents and can provide useful training data for NLP models. A corpus might be built from transcribed speech or a collection of manuscripts. Each item in a corpus is not necessarily unique and frequency counts of words can assist in uncovering the structure in a corpus.\n",
    "\n",
    "Examples:\n",
    "\n",
    "1. Every word written in the complete works of Shakespeare\n",
    "2. Every word spoken on BBC Radio channels for the past 30 years "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Process text"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "# Process sentences 'Hello, world. Natural Language Processing in 10 lines of code.' using spaCy\n",
    "doc = nlp(u'Hello, world. Natural Language Processing in 10 lines of code.')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Get tokens and sentences\n",
    "\n",
    "#### What is a Token?\n",
    "A token is a single chopped up element of the sentence, which could be a word or a group of words to analyse. The task of chopping the sentence up is called \"tokenisation\".\n",
    "\n",
    "Example: The following sentence can be tokenised by splitting up the sentence into individual words.\n",
    "\n",
    "\t\"Cytora is going to PyCon!\"\n",
    "\t[\"Cytora\",\"is\",\"going\",\"to\",\"PyCon!\"]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "# Get first token of the processed document\n",
    "token = doc[0]\n",
    "print(token)\n",
    "\n",
    "# Print sentences (one sentence per line)\n",
    "for sent in doc.sents:\n",
    "    print(sent)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Part of speech tags\n",
    "\n",
    "#### What is a Speech Tag?\n",
    "A speech tag is a context sensitive description of what a word means in the context of the whole sentence.\n",
    "More information about the kinds of speech tags which are used in NLP can be [found here](http://www.winwaed.com/blog/2011/11/08/part-of-speech-tags/).\n",
    "\n",
    "Examples:\n",
    "\n",
    "1. CARDINAL, Cardinal Number - 1,2,3\n",
    "2. PROPN, Proper Noun, Singular - \"Matic\", \"Andraz\", \"Cardiff\"\n",
    "3. INTJ, Interjection - \"Uhhhhhhhhhhh\""
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "# For each token, print corresponding part of speech tag\n",
    "for token in doc:\n",
    "    print('{} - {}'.format(token, token.pos_))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Visual part of speech tagging ([displaCy](https://displacy.spacy.io))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Syntactic dependencies\n",
    "\n",
    "#### What are syntactic dependencies?\n",
    "\n",
    "We have the speech tags and we have all of the tokens in a sentence, but how do we relate the two to uncover the syntax in a sentence? Syntactic dependencies describe how each type of word relates to each other in a sentence, this is important in NLP in order to extract structure and understand grammar in plain text.\n",
    "\n",
    "Example:\n",
    "\n",
    "<img src=\"images/syntax-dependencies-oliver.png\" align=\"left\" width=500>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "# Write a function that walks up the syntactic tree of the given token and collects all tokens to the root token (including root token).\n",
    "\n",
    "def tokens_to_root(token):\n",
    "    \"\"\"\n",
    "    Walk up the syntactic tree, collecting tokens to the root of the given `token`.\n",
    "    :param token: Spacy token\n",
    "    :return: list of Spacy tokens\n",
    "    \"\"\"\n",
    "    tokens_to_r = []\n",
    "    while token.head is not token:\n",
    "        tokens_to_r.append(token)\n",
    "        token = token.head\n",
    "        tokens_to_r.append(token)\n",
    "\n",
    "    return tokens_to_r\n",
    "\n",
    "# For every token in document, print it's tokens to the root\n",
    "for token in doc:\n",
    "    print('{} --> {}'.format(token, tokens_to_root(token)))\n",
    "\n",
    "# Print dependency labels of the tokens\n",
    "for token in doc:\n",
    "    print('-> '.join(['{}-{}'.format(dependent_token, dependent_token.dep_) for dependent_token in tokens_to_root(token)]))\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Named entities\n",
    "\n",
    "#### Named Entities\n",
    "\n",
    "A named entity is any real world object such as a person, location, organisation or product with a proper name. \n",
    "\n",
    "Example:\n",
    "\n",
    "\t1. Barack Obama\n",
    "\t2. Edinburgh\n",
    "\t3. Ferrari Enzo"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "# Print all named entities with named entity types\n",
    "\n",
    "doc_2 = nlp(u\"I went to Paris where I met my old friend Jack from uni.\")\n",
    "for ent in doc_2.ents:\n",
    "    print('{} - {}'.format(ent, ent.label_))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Noun chunks\n",
    "\n",
    "#### What is a Noun Chunk?\n",
    "Noun chunks are the phrases based upon nouns recovered from tokenized text using the speech tags.\n",
    "\n",
    "Example:\n",
    "\n",
    "The sentence \"The boy saw the yellow dog\" has 2 noun objects, the boy and the dog. \n",
    "Therefore the noun chunks will be\n",
    "\n",
    "\t1. \"The boy\"\n",
    "\t2. \"the yellow dog\""
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "# Print noun chunks for doc_2\n",
    "print([chunk for chunk in doc_2.noun_chunks])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Unigram probabilities"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "# For every token in doc_2, print log-probability of the word, estimated from counts from a large corpus \n",
    "for token in doc_2:\n",
    "    print(token, ',', token.prob)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Word embedding / Similarity\n",
    "\n",
    "#### What are Word embeddings?\n",
    "\n",
    "A word embedding is a representation of a word, and by extension a whole language corpus, in a vector or other form of numerical mapping. This allows words to be treated numerically with word similarity represented as spatial difference in the dimensions of the word embedding mapping.\n",
    "\n",
    "Example:\n",
    "\t\n",
    "With word embeddings we can understand that vector operations describe word similarity. This means that we can see vector proofs of statements such as:\n",
    "\n",
    "\tking-queen==man-woman"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "# For a given document, calculate similarity between 'apples' and 'oranges' and 'boots' and 'hippos'\n",
    "doc = nlp(u\"Apples and oranges are similar. Boots and hippos aren't.\")\n",
    "apples = doc[0]\n",
    "oranges = doc[2]\n",
    "boots = doc[6]\n",
    "hippos = doc[8]\n",
    "print(apples.similarity(oranges))\n",
    "print(boots.similarity(hippos))\n",
    "\n",
    "print()\n",
    "# Print similarity between sentence and word 'fruit'\n",
    "apples_sent, boots_sent = doc.sents\n",
    "fruit = doc.vocab[u'fruit']\n",
    "print(apples_sent.similarity(fruit))\n",
    "print(boots_sent.similarity(fruit))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 2",
   "language": "python",
   "name": "python2"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 2
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython2",
   "version": "2.7.10"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 1
}
